SUMMARY

This analysis is based on the premise that widening political divisions are not just about “what to do” but, more fundamentally, about perceptions of “what the priorities are.”
Perceptions of priorities are influenced, in part, by how frequently we hear about an issue. In turn, those who speak about issues we care about tend to attract out attention. It’s the nature of this positive feedback loop that is the subject of exploration here; how can we detect, using statistics, systematic differences in language reflecting priorities and biases.
Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the texts from recent Republican and Democratic presidential debates. Key findings are:
1. “Wordcloud” visualization reveal differences between candidates, though similarities are just as surprising.
2. Frequency analysis of keywords highlights strong differences between candidatates, but misses important context.
3. Bigram toeknization and word-stem searches begin to reveal subtleties of meaning.

DATA SOURCES AND METHODS

The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.

CANDIDATE WORD-CLOUDS

The quickest and most visual method to compare texts is word-frequency analysis using wordclouds. Not surprisingly, word choices vary between candidates. However, there are also some striking similarities.

Let’s first compare the word clouds of candidates using the {wordcloud} package.

TRUMP V. SANDERS

Differences between Donald Trump’s and Bernie Sanders’s dialogue at the debates are evident. Bernie’s word cloud is larger due to have spoken more wtotal words. However, despite the differences, what is most surprising is the similarity of the clouds; word choices like people, country, and going are common to both. Despite strong differences in policy, these word clouds reveal little about them.

c_wordcloud(trump_all)

c_wordcloud(sanders_all)

HILARY V. CARLY

In this case word clouds couldn’t be more different. Hilary’s emphasizes think and people while Carly’s, a former business woman, primarily emphasizes government. However, there is no context to judge, for isntance, what Ms. Fiorina’s opinions or sentiments toward government are. Note again Hilary’s wordcloud is larger than Ms. Fiorina’s.

c_wordcloud(clinton_all)

c_wordcloud(fiorina_all)

CRUZ V. HUCKABEE

Ted Cruz’s wordcloud seems to emphasize wonkish financial technicalities, like taxes and washington, while that of Mike Huckabee, a former minister, mixes language of Mr. Trump and Ms. Fiorina together. Again, no sentiment can be extracted.

c_wordcloud(cruz_all)

c_wordcloud(huckabee_all)

STAYING ON MESSAGE: COMPARING DEBATES

We can also split the text by debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates. Perhaps the candidates are more interested in staying on message than answering questions directly?

c_wordcloud(candidate_text_tc("TRUMP", r_oct))

c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))

c_wordcloud(candidate_text_tc("SANDERS", d_nov))

WORD FREQUENCY

We can check word frequency directly by simply tokenizing the text and counting single words. In a sense this is equivalnt to the wordcloud analysia, but it more quantitative. To do this analysis some additional words like “thats”, “dont”, “back”, “can”, “get”, “cant”, and “come” are suppressed.

This table of the most frequent word used by each candidate.

word trump sanders clinton fiorina sum
think 9 55 90 9 163
know 23 26 56 19 124
well 9 31 56 8 104
people 33 85 53 10 181
government 0 7 6 40 53
every 4 15 9 26 54
need 5 33 36 18 92
country 34 70 25 1 130
going 44 44 45 10 143

Word count differ widely, reflecting the vocabulary choices made by each candidate. In addition, it’s also apparent the number of words said by each candidate differed greatly due to the larger number of Republican candidates (~roughly ten) than Democratic one (~ roughly three). Indeed, the total number of words spoken by Carly Fiorina was 1580 and her vocabularly of distinct words was 702. By comparison, Bernie Sanders said 4314 total words, with a vocabulary of 1375 words.

GRAPHICAL REPRESENTATION

From the above, there may be information in comparing words used frequently by one candidate to frequency of use by another. Here is a graph of the “top” words used by all candidates. From the above we need to be careful to normalize the word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\).

In the graph below the \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.

This starts to be much more informative. For instance, Carly Fiorina mentions the word “government” as almost two percent of her word usage, whereas Donald Trump hardly mentions the word at all. Or notice that both Bernie Sanders and Donald Trump mention the word “wall” more than their competitors while Bernie Sanders alone mentions the word “street” with comparably high frequency.

NORMALIZED Z STATISTICS

The above doesn’t reveal much more information than the wordcloud analysis does. However, we can also pick some “key words” and sample for their frequency. For a first stab, let’s try

key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence")

##       Row.names        trump     sanders      clinton     fiorina all rank
## 1062 government 0.0000000000 0.001622624 0.0012992638 0.025316456  53    1
## 2671       wall 0.0065281899 0.006722299 0.0023819835 0.001265823  53    2
## 2447        tax 0.0071216617 0.002549838 0.0008661758 0.010759494  44    3
## 2377     street 0.0005934718 0.006490496 0.0025985275 0.001265823  43    4
## 1605      money 0.0065281899 0.004404265 0.0006496319 0.004430380  40    5
## 1128     health 0.0000000000 0.004172462 0.0030316154 0.001898734  35    6

WORD ASSOCIATIONS FROM BIGRAM TOKENIZATION

Since word fequency does not convey specific positions on issues, let’s look at word associations to see if we can get closer to meaning from more information about the context of word usage. This analysis simply tokenizes the text as bigrams, then uses a simple function

bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]

to pull out relevant terms from the torkenized TDM. A key challenge is that the texts are relatively short, so statistics comparing the word frequencies are poor. Nevertheless, we can see that context around different words, even at the relatively unsophisticated level of simple bigrams, starts to hint at differences in approach to problems.

CANDIDATES ON TAXES

Bernie talks about “tax” and “terror” as well. His discussion of taxes has a reformist bent, but where Carly Fiorina talks associates words like budgeting, changes, reform, simplify, code, reform, and plan, Bernie Sanders associates words like cap, income, must, share, speculation, breaks, reform, wall, and rebuilding.